21 research outputs found
Recommendation Systems in Libraries: an Application with Heterogeneous Data Sources
The Reading[&]Machine project exploits the support of digitalization to increase the attractiveness of libraries and improve
the users’ experience. The project implements an application that helps the users in their decision-making process, providing
recommendation system (RecSys)-generated lists of books the users might be interested in, and showing them through an
interactive Virtual Reality (VR)-based Graphical User Interface (GUI). In this paper, we focus on the design and testing of the
recommendation system, employing data about all users’ loans over the past 9 years from the network of libraries located in
Turin, Italy. In addition, we use data collected by the Anobii online social community of readers, who share their feedback
and additional information about books they read. Armed with this heterogeneous data, we build and evaluate Content Based
(CB) and Collaborative Filtering (CF) approaches. Our results show that the CF outperforms the CB approach, improving
by up to 47% the relevant recommendations provided to a reader. However, the performance of the CB approach is heavily
dependent on the number of books the reader has already read, and it can work even better than CF for users with a large
history. Finally, our evaluations highlight that the performances of both approaches are significantly improved if the system
integrates and leverages the information from the Anobii dataset, which allows us to include more user readings (for CF) and
richer book metadata (for CB)
Machine learning supported next-maintenance prediction for industrial vehicles
Industrial and construction vehicles require tight periodic maintenance operations. Their schedule depends on vehicle characteristics and usage. The latter can be accurately monitored through various on-board devices, enabling the application of Machine Learning techniques to analyze vehicle usage patterns and design predictive analytics. This paper presents a data-driven application to automatically schedule the periodic maintenance operations of industrial vehicles. It aims to predict, for each vehicle and date, the actual remaining days until the next maintenance is due. Our Machine Learning solution is designed to address the following challenges: (i) the non-stationarity of the per-vehicle utilization time series, which limits the effectiveness of classic scheduling policies, and (ii) the potential lack of historical data for those vehicles that have recently been added to the fleet, which hinders the learning of accurate predictors from past data. Preliminary results collected in a real industrial scenario demonstrate the effectiveness of the proposed solution on heterogeneous vehicles. The system we propose here is currently under deployment, enabling further tests and tunings
Heterogeneous industrial vehicle usage predictions: A real case
Predicting future vehicle usage based on the analysis of CAN bus data is a popular data mining application. Many of the usage indicators, like the utilization hours, are non-stationary time series. To predict their values, recent approaches based on Machine Learning combine multiple data features describing engine status, travels, and roads. While most of the proposed solutions address cars and trucks usage prediction, a smaller body of work has been devoted to industrial and construction vehicles, which are usually characterized by more complex and heterogeneous usage
patterns. This paper describes a real case study performed on a 4-year CAN bus dataset collecting usage data about 2 250 construction vehicles of various types and models. We apply a statistics-based approach to select the most discriminating data features. Separately for each vehicle, we train regression algorithms on historical data enriched with contextual information. The achieved results demonstrate the effectiveness of the proposed solution
On using pretext tasks to learn representations from network logs
Learning meaningful representations from network data is critical to ease the adoption of AI as a cornerstone to process network logs. Since a large portion of such data is textual, Natural Language Processing (NLP) appears as an obvious candidate to learn their representations. Indeed, the literature proposes impressive applications of NLP applied to textual network data. However, in the absence of labels, objectively evaluating the goodness of the learned representations is still an open problem. We call for a systematic adoption of domain-specific pretext tasks to select the best representation from network data. Relying on such tasks enables us to evaluate different representations on side machine learning problems and, ultimately, unveiling the best candidate representations for the more interesting downstream tasks for which labels are scarce or unavailable. We apply pretext tasks in the analysis of logs collected from SSH honeypots. Here, a cumbersome downstream task is to cluster events that exhibit a similar attack pattern. We propose the following pipeline: first, we represent the input data using a classic NLP-based approach. Then, we design pretext tasks to objectively evaluate the representation goodness and to select the best one. Finally, we use the best representation to solve the unsupervised task, which uncovers interesting behaviours and attack patterns. All in all, our proposal can be generalized to other text-based network logs beyond honeypots
Characterizing web pornography consumption from passive measurements
Web pornography represents a large fraction of the Internet traffic, with
thousands of websites and millions of users. Studying web pornography
consumption allows understanding human behaviors and it is crucial for medical
and psychological research. However, given the lack of public data, these works
typically build on surveys, limited by different factors, e.g. unreliable
answers that volunteers may (involuntarily) provide.
In this work, we collect anonymized accesses to pornography websites using
HTTP-level passive traces. Our dataset includes about broadband
subscribers over a period of 3 years. We use it to provide quantitative
information about the interactions of users with pornographic websites,
focusing on time and frequency of use, habits, and trends. We distribute our
anonymized dataset to the community to ease reproducibility and allow further
studies.Comment: Passive and Active Measurements Conference 2019 (PAM 2019). 14 pages,
7 figure
Characterizing Web Pornography Consumption from Passive Measurements
Web pornography represents a large fraction of the Internet traffic, with thousands of websites and millions of users. Studying web pornography consumption allows understanding human behaviors and it is crucial for medical and psychological research.
However, given the lack of public data, these works typically build on surveys, limited by different factors, \eg unreliable answers that volunteers may (involuntarily) provide.
In this work, we collect anonymized accesses to pornography websites using HTTP-level passive traces. Our dataset includes about 15,000 broadband subscribers over a period of 3 years. We use it to provide quantitative information about the interactions of users with pornographic websites, focusing on time and frequency of use, habits, and trends. We distribute our anonymized dataset to the community to ease reproducibility and allow further studies